Maj Jason Freels PhD
29 January 2020
You may have heard one or more of the following terms
Machine learning is a catch-all term for these things
Despite being broad, most machine learning algorithms fit into two categories
Supervised learning algorithms
Unsupervised learning algorithms
You might use a
You could use an
An
Finally,
Example applications of regression algorithms
The following algorithms can be used to model (describe) the relationship between inputs and outputs when the outputs are numeric and continuous
Classification is focused on predicting/labeling a discrete output (categories)
There can be more that two (yes/no) categories
Example applications of classification algorithms
The following algorithms can be used to model (describe) the relationship between inputs and outputs when the outputs are discrete or categorical
Before moving forward I want to make sure that you have a solid (yet high level) understanding of how supervised learning algorithms work under the hood
All supervised learning algorithms have the following elements
Suppose you’re asked to create a model to describe the relationship between one set of inputs and one set of outputs
x and the set of outputs yx and y in the figure belowPlot of some ideal data
Observing the figure, there doesn’t appear to be any uncertainty in the data as each point falls on a straight line
An obvious choice for a model to describe this data would be a function of the form \(y = mx +b\) - the familiar equation for a line
Adding a plot of the line \(y(x) = mx+b\) shows that a “perfect” model exists for our ideal data
Fitting the ideal data with a “perfect” model
Now that we’ve chosen a functional form for \(f_{\text{imperfect}}\) we turn our attention to the parameters of the model
Every model comes with parameters – the values assigned to these parameters affect how well \(f_{\text{imperfect}}\) represent the relationship between x and y
For our chosen \(f_{\text{imperfect}}\) we see that there are two paramters the slope (\(m\)) and the intercept (\(b\))
The question, of course is what are the correct values of the slope \(m\) and the intercept \(b\) that best represent the relationship between x and y
For this ideal data set, we can determine these values using our knowledge of straight lines and our ability to read or we could read the value of the intercept directly from the plot as \(b = 3\)
Using this value for \(b\) we can choose any of the data points and solve for the slope \(m = 5\)
I know what you’re thinking: What does this have to do with machine learning?
x and y constraining this relationship the form of \(f_{\text{imperfect}}\) that we chose aboveA
Loss functions are a key component in all supervised learning algorithms
In many cases, you don’t have to choose a loss function you use to find the optimal parameter values
Rather, it comes as part of a package deal with the modeling approach you choose (i.e. linear regression, logistic regression, etc.).
You can, however, come up with your own loss function - so long as it produces meaningful results
For our perfect example data, we can choose among several different loss functions
However, not all of them will return an accurate solution
In the sections below I walk through the choice of several different loss functions and plot the results
In this case we use a naive loss function that represents the difference between the observed output and the output returned by the proposed model
The loss function is expressed as shown below
\[ Loss_{_{naive}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N y_i-m\times x_i-b. \]
Using this function, loss would simply be defined as the sum of the vertical distances between each observed output \(y_i, i = 1,...,n\) and the output returned by the chosen model
The parameters \(m\) and \(b\) for the best-fit line correspond to model that has the minimum loss.
For our “ideal” data, the points fall on a straight line and we would expect the loss value in this case to be zero
Thus far, we’ve chosen a functional form that we believe is a good representation of the data – and a corresponding loss function
In the chunk below we define our naive loss function
def naive_loss(params, x,y):
if len(params) != 2: sys.exit("Need 2 parameters -- dummy!")
m = params[0]
b = params[1]
loss = 0
for i in range(len(y)):
loss = loss + (y[i] - m * x[i] - b)
return(loss)minimize() function to find the values of \(m\) and \(b\) minimize the loss function and results in a model that best-fits the datafrom scipy.optimize import minimize
import numpy as np
import sys
x = np.arange(0,6,0.5)
y = x * 5 + 3
res = minimize(naive_loss,
x0 = [0,0],
args = (x,y))
res fun: -12693503877.002071
hess_inv: array([[1, 0],
[0, 1]])
jac: array([0., 0.])
message: 'Optimization terminated successfully.'
nfev: 64
nit: 1
njev: 16
status: 0
success: True
x: array([3.39728820e+08, 1.23537753e+08])
Looking at these results, it’s clear that something isn’t right - why?
\[ Loss_{_{absolute}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N \Big\vert y_i-m\times x_i-b\Big\vert. \]
def absolute_loss(params, x,y):
if len(params) != 2: sys.exit("Need 2 parameters -- dummy!")
m = params[0]
b = params[1]
loss = 0
for i in range(len(y)):
loss = loss + np.abs(y[i] - m * x[i] - b)
return(loss)minimize() function to find the values of \(m\) and \(b\) minimize the loss function and results in a model that best-fits the datafrom scipy.optimize import minimize
import numpy as np
import sys
x = np.arange(0,6,0.5)
y = x * 5 + 3
res = minimize(absolute_loss,
x0 = [0,0],
args = (x,y))
res fun: 3.135918174024255e-07
hess_inv: array([[ 2.22446360e-09, -1.33536208e-08],
[-1.33536208e-08, 8.83478893e-08]])
jac: array([ 2.31926751, -1.050596 ])
message: 'Desired error not necessarily achieved due to precision loss.'
nfev: 471
nit: 20
njev: 115
status: 2
success: False
x: array([4.99999998, 3.00000003])
The problem is that linear functions are unconstrained
A better option would be to propose a loss functions that is convex, such as
\[ Loss_{_{convex}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N \Big( y_i-m\times x_i-b\Big)^2. \]
Note that we are minimizing the squared distances between the observed values and the proposed model - hence this is called
For the last time let’s define our convex loss function
def convex_loss(params, x,y):
if len(params) != 2: sys.exit("Need 2 parameters -- dummy!")
m = params[0]
b = params[1]
loss = 0
for i in range(len(y)):
loss = loss + (y[i] - m * x[i] - b)**2
return(loss)minimize() function to find the values of \(m\) and \(b\) minimize the loss function and results in a model that best-fits the datafrom scipy.optimize import minimize
import numpy as np
import sys
x = np.arange(0,6,0.5)
y = x * 5 + 3
res = minimize(convex_loss,
x0 = [0,0],
args = (x,y))
res fun: 1.81798298894111e-14
hess_inv: array([[ 0.01398601, -0.03846154],
[-0.03846154, 0.1474359 ]])
jac: array([-8.88152632e-07, -7.45724819e-07])
message: 'Optimization terminated successfully.'
nfev: 28
nit: 5
njev: 7
status: 0
success: True
x: array([5. , 2.99999997])
Simple approach for supervised learning
A useful tool for predicting a quantitative response
Is a useful and widely used statistical learning method
Serves as a good jumping-off point for newer approaches:
Many statistical learning algorithms are generalizations or extensions of linear regression
Important to understand linear regression before studying more complex learning algorithms
This lecture serves as an introduction to simple linear regression and will cover the following topics
In a subsequent lecture we’ll extend this to discuss the following
import random
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import scipy as sp
import statsmodels.api as sm
import statistics as stThis lecture uses the advertising data set made available by the authors of the text An Introduction to Statistical Learning
The data set contains four columns of advertising-related features captured across 200 markets
TV: Money spent on television advertising (in thousands of dollars)radio: Money spent on radio (in thousands of dollars)newspaper: Money spent on newspaper advertising (in thousands of dollars)Sales: return on the advertising investment (in dollars)The question to be addressed by analyzing this data set
Sales and the way that advertising dollars are spentTV advertising impact Salesradio advertising impact Salesnewspaper advertising impact SalesSalesTo access the data, we’ll extract it from the CSV file and store it as a pandas DataFrame using read_csv()
dfdata_url = "http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv"
df = pd.read_csv(data_url) Unnamed: 0 TV radio newspaper sales
0 1 230.1 37.8 69.2 22.1
1 2 44.5 39.3 45.1 10.4
2 3 17.2 45.9 69.3 9.3
3 4 151.5 41.3 58.5 18.5
4 5 180.8 10.8 58.4 12.9
Instead, use the usecols argument to select which columns we want to extract
Note that we supply these column names as a list using { }
TV radio newspaper sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9
Initial discovery of relationships is usually done with a training set while a test set is used for evaluating whether the discovered relationships hold
We’ll use a 60% / 40% split to partition the data set
rows = range(df2.shape[0])
train_rows = random.sample(rows,
int(0.6 * len(rows)))
test_rows = list(set(rows) - set(train_rows))
train_data, test_data = df2.iloc[train_rows], df2.iloc[test_rows]
train_data.shape(120, 4)
(80, 4)
Simple linear regression is a straightforward approach for predicting a quantitative response \(Y\) on the basis of a single predictor variable \(X\).
Assumes there is an approximately a linear relationship between \(X\) and \(Y\)
Using the advertising data, suppose we wish to model the relationship between the TV budget and sales
To build a simple regression model in Python we use the OLS() (ordinary least squares) function from the api module in the statsmodels library
In matrix notation, a linear regression model is expressed as
\[ \begin{aligned} \boldsymbol{y} &= \beta_0 + \beta_{1}\boldsymbol{X} + \epsilon\\\\ \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} &= \begin{bmatrix} 1& X_{1}\\ 1& X_{2}\\ \vdots & \vdots \\ 1& X_{m} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix}+ \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{m} \end{bmatrix} \end{aligned} \]
where:
To build this model, we first need to define the \(X\)-matrix of inputs and the \(y\) vector of responses from the training data - this is done below
Building a linear regression model is only half of the work
To use the model you must should check that it conforms to the assumptions of linear regression
This just means that no functions are being applied to the \(\beta\) parameters
By simply viewing our model we see that this assumption is met
The result of a simple linear regression model is a line of best fit
The data points usually don’t fall exactly on this regression equation line
A residual is the vertical distance between a data point and the regression line
For each data point we get one residual
We want to see that the residuals are approximately equally distributed about zero
We can check to see if the mean of the residuals is in fact zero by extracting the residuals from the model and then taking the mean of these values
This is done in the code chunk below - we see that the result is a value that, numerically speaking, is zero
-4.51490696680897e-16
plot = sb.residplot(model_fit.fittedvalues,
train_data.columns[3],
data=train_data,
lowess=True,
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot.set_title('Residuals vs Fitted')
plot.set_xlabel('Fitted values')
plot.set_ylabel('Residuals');
plt.show(plot)This is typically verified through graphical means by inspecting the residual plot
We want to see that there’s no shape in residuals moving from right to left - essentially a rectangular cloud of points
Observing the residual plot for this data set we see that there is a distinct funnel shape where the variance increases as we move from left to right - this is not good
Often we implement a transformation on either the predictor (input) variable or the response (output) variable (or both) to ensure that this assumption is met
In the code chunk below you can explore different type of transformations on both of these variable to see how this impacts the shape of the residual plot
df2 = pd.read_csv(data_url,
usecols = {'TV','newspaper','radio','sales'})
##apply a transformation on the TV column
df2["TV"] = (df2["TV"])**2
# apply a transformation on the sales column
df2["sales"] = np.sqrt(df2["sales"])
train_data, test_data = df2.iloc[train_rows], df2.iloc[test_rows]
X = train_data[{"TV"}]
y = train_data["sales"]
X = sm.add_constant(X)
model = sm.OLS(y, X)
model_fit = model.fit()
## model residuals
model_residuals = model_fit.resid
st.mean(model_residuals)
plot_lm = plt.figure()
plot_lm.axes[0] = sb.residplot(model_fit.fittedvalues,
train_data.columns[3],
data=train_data,
lowess=True,
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm.axes[0].set_title('Residuals vs Fitted')
plot_lm.axes[0].set_xlabel('Fitted values')
plot_lm.axes[0].set_ylabel('Residuals');
plt.show()This is another assumption that is often checked via graphical methods
We are looking to verify that the residuals are normally distributed by plotting them on what is know as a quantile-quantile plot (or QQ plot)
from statsmodels.graphics.gofplots import ProbPlot
model_norm_residuals = model_fit.get_influence().resid_studentized_internal
QQ = ProbPlot(model_norm_residuals)
plot_lm_2 = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)
plot_lm_2.axes[0].set_title('Normal Q-Q')
plot_lm_2.axes[0].set_xlabel('Theoretical Quantiles')
plot_lm_2.axes[0].set_ylabel('Standardized Residuals');
# annotations
abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid_top_3 = abs_norm_resid[:3]
for r, i in enumerate(abs_norm_resid_top_3):
plot_lm_2.axes[0].annotate(i,
xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
model_norm_residuals[i]));
plt.show()The code in the chunk below is used to generate a summary of the model we built earlier
Below we describe what several of these values mean
OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.595
Model: OLS Adj. R-squared: 0.591
Method: Least Squares F-statistic: 173.2
Date: Wed, 29 Jan 2020 Prob (F-statistic): 6.80e-25
Time: 12:50:22 Log-Likelihood: -311.21
No. Observations: 120 AIC: 626.4
Df Residuals: 118 BIC: 632.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 7.2470 0.599 12.101 0.000 6.061 8.433
TV 0.0459 0.003 13.159 0.000 0.039 0.053
==============================================================================
Omnibus: 0.520 Durbin-Watson: 2.131
Prob(Omnibus): 0.771 Jarque-Bera (JB): 0.511
Skew: -0.153 Prob(JB): 0.775
Kurtosis: 2.907 Cond. No. 345.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
X_new = test_data["TV"]
X_new = sm.add_constant(X_new)
# make the predictions by the model
test_data["predictions"] = model_fit.predict(X_new)
test_data TV radio newspaper sales predictions
1 44.5 39.3 45.1 10.4 9.289676
3 151.5 41.3 58.5 18.5 14.201343
9 199.8 2.6 21.2 10.6 16.418479
14 204.1 32.9 46.0 19.0 16.615864
15 195.4 47.7 52.9 22.4 16.216504
.. ... ... ... ... ...
188 286.0 13.9 3.7 15.9 20.375356
189 18.7 12.1 23.4 6.7 8.105367
190 39.5 41.1 5.8 10.8 9.060158
191 75.5 10.8 6.0 9.9 10.712682
195 38.2 3.7 13.8 7.6 9.000484
[80 rows x 5 columns]